This project is intended for CAP6768 - Data Analytics - Discussion 6
For this discussion activity, perform the following tasks:
Note: Read other students’ posts before selecting your dataset for this discussion activity. If another student has chosen a dataset, then you will have to choose a different dataset.
State the investigative question in relevance to the selected dataset and regression analysis. State the relationships between variables that you intend to investigate.
Perform appropriate regression analysis to conduct the stated investigation. Explain the tooling used and processes used for performing regression analysis.
Provide the regression analysis results and explain the findings in relevance to the investigative questions posed.
For this discussion, I’ve choosen the dataset “Spread of Shotgun Pellets by Distance of Shot”. This dataset looks at the measured spread of shotgun pellets for different distances from a target.
We start by loading the downloaded datafile and renaming the columns to match the description.
We display to resulting dataset below.
shotgun_data <- read.table("shotgun_spread.dat", header = FALSE, skip = 3)
colnames(shotgun_data) <- c("cartridge", "range", "sqrt_spread", "std_of_sqrt_spread_for_range")
shotgun_data <- shotgun_data %>%
mutate(
cartridge = case_when(
cartridge == 1 ~ "Winchester Western Super X 00",
cartridge == 2 ~ "Remington No. 4"
)
)
shotgun_data %>% datatable()
In order to better understand the dataset, we start by plotting the data onto a scatterplot.
shotgun_data %>%
plot_ly(x = ~range, y = ~sqrt_spread, color = ~cartridge) %>%
layout(
title = "Square Root of the Spead verses Range",
xaxis = list(title = "Range"),
yaxis = list(title = "Sqrt. Spread")
)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
As we can see, there is a clear positive correlation between the range and the square root of the spread. We also notice that the increase in range correlates withe a wider variance of values for the sqrt of the spread. We can show this by looking at the std deviation of the sqrt of the spread for each range. Luckily, this is already provided in the dataset.
shotgun_data %>%
plot_ly(x = ~range, y = ~std_of_sqrt_spread_for_range, color = ~cartridge) %>%
layout(
title = "Std. of Sqrt of Spread verses Range",
xaxis = list(title = "Range"),
yaxis = list(title = "Std. of Sqrt of Spread")
)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
As we can see, there is an exponential relation between the Range and the square root of the spread.
We will use regression to measure the correlation between these variables.
model_1 <- shotgun_data %>%
lm(sqrt_spread ~ range, data = .)
shotgun_data %>%
plot_ly(x = ~range) %>%
add_markers(y = ~sqrt_spread, color = ~cartridge) %>%
add_lines(x = ~range, y = fitted(model_1), name = "Regression") %>%
layout(
title = "Square Root of the Spread verses Range",
xaxis = list(title = "Range"),
yaxis = list(title = "Sqrt. Spread")
)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
summary(model_1)
##
## Call:
## lm(formula = sqrt_spread ~ range, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8804 -0.6920 -0.1348 0.5280 5.4496
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.13357 0.50110 -0.267 0.791
## range 0.29628 0.01476 20.073 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.529 on 55 degrees of freedom
## Multiple R-squared: 0.8799, Adjusted R-squared: 0.8777
## F-statistic: 402.9 on 1 and 55 DF, p-value: < 2.2e-16
We are also able to fit an exponential regression to each of the spread outputs. We will create two different models for each of the cartridge types.
cartridge1_data <- shotgun_data %>% filter(cartridge == 'Winchester Western Super X 00')
cartridge2_data <- shotgun_data %>% filter(cartridge == 'Remington No. 4')
model_2 <- cartridge1_data %>%
lm(std_of_sqrt_spread_for_range ~ range, data = .)
model_3 <- cartridge2_data %>%
lm(std_of_sqrt_spread_for_range ~ range, data = .)
Cartridge 1, Winchester Western Super X 00 Regression Model
cartridge1_data %>%
plot_ly(x = ~range) %>%
add_markers(y = ~std_of_sqrt_spread_for_range, color = ~cartridge) %>%
add_lines(x = ~range, y = fitted(model_2), name = "Regression") %>%
layout(
title = "Std. of the Spread verses Range",
xaxis = list(title = "Range"),
yaxis = list(title = "Std. of the Spread")
)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
summary(model_2)
##
## Call:
## lm(formula = std_of_sqrt_spread_for_range ~ range, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14863 -0.10080 -0.05646 0.16403 0.28371
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.663886 0.079937 -8.305 1.18e-08 ***
## range 0.049917 0.002297 21.732 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1569 on 25 degrees of freedom
## Multiple R-squared: 0.9497, Adjusted R-squared: 0.9477
## F-statistic: 472.3 on 1 and 25 DF, p-value: < 2.2e-16
cartridge2_data %>%
plot_ly(x = ~range) %>%
add_markers(y = ~std_of_sqrt_spread_for_range, color = ~cartridge) %>%
add_lines(x = ~range, y = fitted(model_3), name = "Regression") %>%
layout(
title = "Std. of the Spread verses Range",
xaxis = list(title = "Range"),
yaxis = list(title = "Std. of the Spread")
)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
summary(model_3)
##
## Call:
## lm(formula = std_of_sqrt_spread_for_range ~ range, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3450 -0.0762 0.0426 0.0924 0.2862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.544400 0.092351 -5.895 2.43e-06 ***
## range 0.067120 0.002784 24.105 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2157 on 28 degrees of freedom
## Multiple R-squared: 0.954, Adjusted R-squared: 0.9524
## F-statistic: 581 on 1 and 28 DF, p-value: < 2.2e-16